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Abstract 

This paper reports Part I of a two part effort that is 
intended to delineate the relationship between reliabil- 
ity and fault tolerant control in a quantitative manner. 
Reliability analysis of fault -tolerant control systems is 
performed using Markov models. Reliability properties 
peculiar to fault- tolerant control systems are empha- 
sized. As a consequence, coverage of failures through 
redundancy management can be severely limited. It is 
shown that in the early life of a system composed of 
highly reliable subsystems, the reliability of the overall 
system is affine with respect to coverage, and inade- 
quate coverage induces dominant single point failures. 
The utility of some existing software tools for assessing 
the reliability of fault tolerant control systems is also 
discussed. Coverage modeling Is attempted in Part II 
in a way that captures its dependence on the control 
performance and on the diagnostic resolution. 


1 Introduction 

Highly reliable systems make use of redundancy to 
achieve fault tolerance, due to limited reliability of 
components or subsystems^. Utilization of analytic 
redundancy ! 5 i that provided by static and dynamic 
relations among system variables, such as secondary 
functions of effectors, virtual measurements, projec- 
tions, etc. can further reduce the probability of exhaus- 
tion of hardware in a cost-effective manner. Analytic 
redundancy management of complex control systems, 
however, involves considerable more risks in compari- 
son with such schemes as majority vot ing, for decision 
making is often based on residual signals formed by 
the differences between noisy measurements and cal- 
culated values of output variables based on inaccu- 
rate models. Decision errors can be associated with 
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uncertainties on whether there is a subsystem failure, 
which subsystem has failed, how severe is its effect, 
whether it is necessary to take a drastic corrective 
action, which action to take. In addition, the ques- 
tion may also arise on whether there is adequate con- 
trol relevant redundancy! 1 5 1 and authority to allow re- 
covery from the effect the failure. The dynamic and 
closed-lcop nature, common to all control systems, is 
the source for additional difficulties, such as temporary 
mask of the effect of subsystem failures, the vagueness 
in the definition of a system level failure in the context 
of control performance, and the sometimes significant 
processing requirement in supporting the redundancy 
management. 

Definitions suggested in [9] on fault and failure are 
adopted with a slight extension. A fault is an unperniit- 
ted deviation of at least one characteristic property or 
variable of the system. A failure is a permanent inter- 
ruption of a system's ability to perform a required func- 
tion under specified operating conditions. Note that a 
failure can also be defined in the subsystem level. A 
fault may or may not lead to a failure. W ithout loss 
of generality, a subsystem failure is assumed to always 
lead to the system failure unless a successful manage- 
ment of redundancy ensues. A system level failure is 
declared when faults or subsystem failures cause the 
control performance of the system to fall below the pre- 
scribed threshold. The performance threshold can be 
set at two (or more) different levels, each correspond- 
ing to a specific reliability requirement. In aviation, for 
example, one level can be set by the ability to carry out 
a norm* 1 mission (or mission abort in terms of failure 
probabi ity), and another can be set by the ability to 
merely maintain the system stability needed for safe 
landing (loss of control in terms of failure probability). 
This paper will treat different reliability requirements 
in a uni led manner. 

Reliability is naturally a subjective concern in the 
analysis and design of fault-tolerant control sys- 
tems. Few publications that formally adderess this 
issued 0, lb 13) have confined the scope of discussion 
to reliability assessment for dynamic systems subject 
to faults. Reliability is rarely regarded as an objective 



criterion that guides a control system design in an inte- 
grated manner. This predicment is due? to the difficulty 
in establishing a functional linkage between the over- 
all system reliability, and the performance defined in 
the conventional sense at the bottom level for controls 
and for diagnosis. The only attempt prior to this work 
along this direction is reported in [i t] where such a 
linkage is established through coverage under the pos- 
sibilistic formalization. The possibihstic formalization 
provides flexible and usually more accurate descriptions 
of uncertainties, but suffers from lack of corresponding 
theoretical and numerical means for reliability analy- 
sis. This paper is intended to address the reliability 
issue of fault tolerant control systems in the more fa- 
miliar probabilistic formalization so that existing tools 
and methodologies of reliability analysis can be applied. 
The paper is organized as follows. Section 2 presents 
issues encountered in our endeavor through a reliabil- 
ity analysis case study of a fault tolerant flight control 
system. Section 3 discusses numerical techniques for 
reliability assessment of fault tolerant systems with em- 
phasis on how coverage enters reliability as a decision 
risk factor. Several approximate relations are derived 
to reveal the dependence of reliability to coverage in 
simple forms. 


2 A Case Study 

To understand some of the reliability issues peculiar to 
fault tolerant control systems, we start with a reliability 
assessment case study of a fault tolerant flight control 
system (FTFCS). A complete report on this case can 
be found in [13]. Observations that follow should serve 
to motivate a focused effort in coverage modeling of 
fault tolerant control systems. 
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Fig.l Effector functional dependency in a FTFCS 

Fig.l shows the functional dependencies of subsystems 
in the pitch/roll control effector block within a fault 
tolerant flight control system^. The diagram re- 
flects the available redundant lateral control author- 


ities in lie system and the extent such redundancy 
is utilized for subsystem failure recovery. Each effec- 
tor channel contains an actuator subsystem which is 
preceded by a group of three or four active identical 
computer /effector interface subsystems, then followed 
by a coi.trol surface. Two of the effectors are consid- 
ered as primary, and two as secondary. Every com- 
puter /ef-ector interface subsystem blocks is of n-plex 
architecture (group of n active identical subsystems). 
Other blocks that precede the lateral-directional effec- 
tor bloc c but are not shown in the figure include a 
computer power supply block, an I/O control module 
block, a pilot command sensor block, and an aircraft, 
state sensor block. The block following this effector 
block is a roll control effector block. The functional 
dependency of the fault tolerant flight, control system 
altogether is described bv a two- layer parallel- to-series 
intercon lection scheme. 

The reliability indicator used in the following discussion 
is the probability of loss of control denoted by Ploc- 
Each small box in Fig.l represents a subsystem, where 
Ax (A -- /, A, S) are the failure rates in terms of fail- 
ures per hour. Under the assumption of low subsystem 
failure rates, short mission time, and highly rigorous 
maintenance requirements, constant failure rates are 
appropriate. Safety requirement for inner layer paral- 
lel configuration (the n-plex computer /effector inter- 
face subsystem) considered is 1-out-of-u. Safety re- 
quirement for the outer layer parallel configuration is 
3-out-of 1. This means that the three remaining effec- 
tor channels must work in concert to accommodate a 
failure in one effector channel. 

The redundancy architecture shown in Fig.l does not 
truely reflect how effector channel hardware is config- 
ured. It must be understood as an effective redun- 
dancy configuration which assumes that any anomaly 
in an effector channel serious enough to warrant a con- 
trol adaptation or reconfiguration action for failure ac- 
commocation can do so promptly and successfully. In 
reality, however, due to uncertainties in the model of 
the system to be controlled, uncertainties in the mod- 
els of signals exerted on the system, and the limited 
processing capability, considerable risks exist in mak- 
ing a decision on a correct ive action. These decision 
risks mi st be taken into consideration in reliability as- 
sessment. The risks encountered may include overly 
slow or severe transients, false alarm, miss detection, 
false identification, false reconfiguration, and lack or 
exhaustion of redundancy. The notion of coverage is 
now use ! to account for such risks. It represents an at- 
tempt to separate handling of failures from occurrence 
of failures. Coverage defined in this context is highly 
scenario dependent, highly time dependent, and most 
of all, difficult to estimate. Coverage has been used as 
a parameter to reflect the ability of a system to auto- 
matically recover from the occurrence of a fault during 




a normal system operational ; 

Coverage = Probability (System recovcrs\ Fault occurs). 

Once a decision is made however, the process of remov- 
ing a subsystem or reconfiguring the system is generally 
involved. This process, though fast in comparison with 
a failure process, still takes time, and has been shown to 
be generally non-exponent ially distributed. Including 
this process in a reliability model implies the creation 
of a numerically stiff problem 


Some results of reliability assessment for the system 
of Fig.l are now presented. All coverage values are 
obtained based on test datat 1 ^, which aggregate the 
effects of decision errors. Since these values are fixed, 
they are called static coverage values A coverage value 
of 0.99 is used when an actuator failure is accommo- 
dated. The following table gives coverage of a com- 
puter/effector interface subsystem failure. 


Redundancy management i intact subsystems 


Majority voting 4 

Majority Voting 3 

Comparing 2 

Self-monitoring 1 


Table 1 


Coverage 

0.992 

0.99 

0.89 

0.75 


Coverage associated with surface damage is left as a 
variable whose required value is yet, to be determined 
for the reason that it is where improvement is needed 
most. A realistic estimate of static coverage can be ob- 
tained by counting the number of unsuccessful surface 
failure recoveries and taking the ratio with respect to 
the total number of simulated surface impairment with 
a full scale simulator. Note that such static coverage 
values infer from a rather small sample of coverage data 
to a general population, which do not address a spe- 
cific process well, and therefore are inadequate for use 
to make online decisions. Section 4 will discuss cover- 
age modeling for more accurate coverage prediction. 


The approximate parameter ranges in the Markov used 
in our case study are now given. The overall system 
reliability is required to achieve I — 10“'. 

Subsystem failure rate Ax 
Subsystem mean time to recover \ij 
Variance of time to recover Cj 
Mission time T 

Table 2 


!0 6 ■ 

~ 10 ~ 4 

hour - 

10 3 

~ 10 4 

hours 

i0 - 

~ 10 4 

hours 

■0° ~ 

10 1 

hours 


The above table reflects two common characteristics 
of highly reliable fault tolerant systems: details due 
to small failure probabilities cannot be arbitrarily ig- 
nored, and recovery process is much faster than failure 
process ( 10 7 times faster at least). As a result, one 
is faced with solving a numerically stiff problem. For- 
tunately, successful attempts have been made to effec- 
tively deal with the stiff problem both theoretically and 
numerically^’ ^ . 


Under a set of given failure rates and mission time, the 
following results are obtained for the effector block 


Surface failure coverage 
100 % 

99% 

85% 


Approximate Ploc 
10 10 
io - 7 

1(T 6 
Table 3 


Though t has been observed that use of analytic redun- 
dancy an greatly increase the overall system reliabil- 
ity (10 1 o 10 4 times), imperfect coverage has clearly a 
dominating effect on system reliability. It is found nu- 
merically that Ploc decreases linearly with increasing 
surface carnage coverage up to an almost perfect cover- 
age value. It is also found that reducing the redundancy 
of the computer-effector interface from quadruplex to 
triplex redundancy slightly increases the overall system 
reliability^! . 


These claims will be affirmed through analyt ical means 
in the next section. The potential benefit of enhancing 
coverage and the potential cost of additional hardware 
redundancy have now given us sufficient motivation to 
investigate what factors affect coverage and in what 
wavs co\ erage is affected in a fault tolerant control sys- 
tem . 


3 Coverage in Reliability Assessment 

In this section, the development of reliability model and 
the numerical technique used for obtaining the results 
of the previous section are presented. Several general 
results regarding the critical role of coverage in relia- 
bility assessment are then derived. Coverage modeling, 
calculation, and its role in relating fault tolerant con- 
trol to reliability will be discussed in the next section. 

Reliability modeling can be regarded as a process of 
identifying the structure function of a system com- 
prised of N subsystems with positive random lifetimes. 
The structure function defines a mapping: {0, 1} N — ► 
{0, 1 } l 1 ) Reliability assessment can be regarded as a 
process of evaluating the mapping, given state tran- 
sition probabilities. A subsystem is in state “IT (in- 
tact) before its lifetime and state “O' 1 (failed) after its 
lifetime. The fundamental assumption of a Markov 
process is that the probability that a system will un- 
dergo a transition from one state to another state de- 
pends only on the current state of the system and not 
any previous states the system may have experienced. 
A Markov process where all state transition rates are 
time-invariant is said to be homogeneous. 

Keeping in mind the case study of the previous section, 
the following assumptions are used in the subsequent 
development of reliability models 

(a) all subsystems are operational at the onset; 

(b) failure probability of any given subsystem is 1— e~ xt 
where A is the constant failure rate of that subsystem; 

(c) a fai ure in any subsystem is independent of that in 





Given the large disparity between failure rates (A) and 
recovery rates (1 / jij) as shown in Table 2, it is mean- 
ingful to examine the condition under which the re- 
covery times in the Markov model can be eliminated. 
The rationale for this intent lies with t he simplification 
to a homogeneous Markov process. Suppose this elim- 
ination is allowed, the Markov model will have been 
simplified to that depicted in Fig. 3(a). The sum of the 
probabilities of the death states that, have been elimi- 
nated as a result of ignoring the recovery time is now 
estimated. First a result on an approximate failure 
probability is given. 


with an approximation error bounded bv 6.0 x 10 8 . 

The inequality and error bound in Theorem 1 becomes 
more and more conservative as k becomes larger and 
larger than 1, for more terms are added without bing 
subtract- xl in completing the binomial forms. When 
CVS are closer to 1, tighter bounds can be obtained by 
consider ng kXci as a fast rate relative to kX(l — c t ). 

The next result states the condition on the elimination 
of the recovery times. This is equivalent to setting pi = 
0, Vi. 


Theorem 1. Assume (a) through (It) hold for a k-out- 
of n system. In addition , Co < 1, and n AT << 1. Then 
the system failure probability is dominated by 

P‘d{T) = nA7'(l —c 0 ) (3) 


if 


i co >> ii 


l)A^[(l 4- AT) n - + A T) n 

7t AT(1 - 


where 


(1 + nAT)] 


( 4 ) 


p ~ max{/zi,/i 2 , * • ■ , ,/i n -i}, pi — 0. Vi > n — k (5) 


Theorem 2. Assume (a) through (h) hold for a k-out- 
of-n system. In addition , cq < 1, nXT « 1. In addi- 
tion, assume pp(T) = r?AT(l - cq) dominates Pd(T). 
Then the recovery times can be ignored in the system 
failure probability calculat ion with an error Cq bounded 

by 

nXp[(] F AT) n — 1], p = -nuix{pi,p’ 2 '- mw i}, (7) 

where /q = 0, Vz > n - A:. In this case , 


l 


(n-l)|i(l + \T) n - 1 

n T 1 - 


n\T<< 1 


11 \fj. . 




In this case the approximation error \pp(T) — p‘p(T) \ 
satisfies 

a n » (nAT) 2 

iPpt r )-PD ( T)l < tnax-;(n-l),\^!(l+AT)^-l]-U -( 1 -nAT), 

"(6) 


A key step to proving Theorem 1 is the application of 
White’s boundsM by which it is required that state 
transitions be considered as separate transitions repre- 
senting disjoint events of traversing paths to exit states. 
The rest of the proof involves employing adjusting the 
bounds using Binomial forms. Due to paper length 
limit, proofs for all theorems are omitted. 

Essentially, Theorem 1 states that if Cq is not suffi- 
ciently close to 1 in the sense defined above, the failure 
probability of the Markov process becomes linear with 
respect AT, and to 1 — cq. The most important implica- 
tion here is that in order to effectively take advantage 
of redundancy, it is crucial to have the highest possi- 
ble coverage for the first failure that occurs in the sys- 
tem. As will be shown in the next section, this can be 
achieved only through integrated design of the entire 
system. To gain a sense on how far Co must be from 
1 in order for the simple formula to be valid, values 
given in Table 2 are used. With n ■- 1. k = 1, T = 1, 
p — 10“ 8 and A = 10" 4 


1 - CQ >> 


3A^[(1 + AT) 4 - 1) + ;(1 + AT) 4 
4 A T ( 1 - 2 AT) 


< 1 - 4 AT)] 


0.00016 


must be satisfied. Using Co = 0.902 from Table 1, 
1 — co = 1 — 0.992 = 0.008 >> 0.00018. The following 
approximation on the system failure has been obtained 

Pd(T) - 3.2 x 10 > 


The proof of Theorem 2 is similar to that of Theorem 
1. In fact the above inequality is implied by (4). To 
see how easily this condition is satisfied, the right hand 
<1_c ° )} side is calculated using again n = 4, k = 1, p = 0.001, 

T = 1, and A = 10~ 4 , which leads to 1 — cq » 3x 10 1 
This is met if 1 - c 0 = 10“ 5 , or c 0 = 0.99999. In com- 
parison with the used value of c.q = 0.992, the condition 
for eliminating recovery time is well satisfied. There- 
fore, whenever the failure probability' is dominated by 
??,AT(1 — Co), elimination of recovery time is permissi- 
ble. In the case of the system of Fig. 2, pp(T) has can 
now be expressed analytically as 

pjo(T) = ;P(0)e Oj,r ]i >5 4AT ^ <1 ;(/+Q a T)]i fB = 4AT(1 — c 0 ) = p£{T). 

We now proceed to demonstrate the solution proce- 
dure for the more complex two- layer parallel-to-series 
interconnection scheme encountered in the case study 
of Section 2. Since the achievable 1 — Co in our case 
study satisfies the condition of (4), the effector block 
failure probability can be approximated by. after the 
application of (3) to both the inner and the outer lay- 
ers of parallel interconnections 

po(T) ~ 4{\iTn(l-c I 0 ) + \ a T{1-c$) + XsT(l-ci)}, (9) 

where; n (= 1,2,3, or 4) is the redundancy level in 
the computer /effect or interface portion. This formula 
holds when 1 — » X x T, A = /, -4. S. In par- 

ticular, improvement in coverage, even by a small per- 
centage (from .99 to .999, for example) could reduce 
the system failure probability by an order of magni- 
tude. On the other hand, the level of redundancy in 
the interface portion in each effector channel deserves 
the consideration for optimization. The following table 
summaiies the contribution of the first term in (9) (the 
interfac e portion) to system failure probability. 



Redundancy Management Coverage <'q 
4 Majority voting 0.902 

3 Majority Voting 0.09 

2 Comparing 0.89 

1 Self- monitoring 0.75 

Table 4 


Flffect on Pd 
4 AT x 0.032 
4AT x 0.030 
4 AT x 0.220 
4 AT x 0.25 


The numbers in the last column are the products of 
n and 1 — Cq f° r different redundancy level n. Since 
Cq is a decreasing function of n as shown in the table, 
it turns out that the minimum appears at n = 3, i.e., 
the 3-plex interface architecture minimizes the system 
failure probability. 


Enhanced coverage has been shown t.«> be the key to en- 
hanced system reliability. There are applications, such 
as civil aviation, where system reliability requirement 
is as stringent as, for example. 1—10 \ Given the lim- 
itation of individual subsystem reliability, the analysis 
of this section concludes that coverage of first subsys- 
tem failures in such systems must be raised to a value 
extremely close to 1 to avoid inducing dominant sin- 
gle point failures. At this extremely high coverage, 
the approximate formulae given in this section is no 
longer accurate, and the use of an elaborate and rig- 
orous numerical tool such as WinSI'ltE becomes nec- 


essary. (tor the data given in Table 2, however, the 
failure probability calculation result of the above ap- 
proximate formula is indistinguishable from the upper 
and the lower bounds given by ASSIS ^citeASSIS T an( j 
SURE cite ^^.) High coverage, at the same time, im- 
poses extremely stringent requirement on redundancy 
management. Such a requirement must be reflected at 
the bottom levels on the control and diagnostic perfor- 
mance requirements, which will be discussed in part II 
of the paper. 


4 Conclusions 

The main contributions of the paper are presented in 
Theorems 1 and 2. 

Theorem 1 states that when coverage is not sufficiently 
high, the uncovered subsystem failures dominate the 
system failure, and the system failure probability in- 
creases linearly with decreasing coverage values. This 
can significantly undermine the benefit of using redun- 
dancy. Therefore, every effort should be made to en- 
hance coverage of first subsystem failures. Theorem 
2 states that when the uncovered failures are domi- 
nating, the recovery times can be ignored if they are 
several orders of magnitude faster than the subsystem 
failure times on average. In this case, a numerically 
stiff problem is avoided, and reliability analysis of a 
complex system can be much simplified. 

It is necessary to point out that the motivating force 
of this work comes from the set goals of the on going 


NASA/TAA aviation safety program! 2 !. Though the 
main conclusions drawn in this paper should hold for 
many areas of applications, the reader is cautioned to 
pay attention to the conditions stated upon which the 
conclusions are drawn, especially when they are em- 
ployed to applications of vastly different time scales. 


References 

[1] A%en, T., and Jensen, U., Stochastic Models in Reli- 
ability, Springer-Verlag, 1999. 

[2] Belcastro, C., and Belcastro, C., Application of fail- 
ure detection, identification, and accommodation methods 
for improved aircraft safety, to appear in Proc. American 
Control Conference, 2001. 

[3] R.W. Butler, An abstract language for specifying 
Markov reliability models, IEEE Vrans. on Reliability, 
vol.R-35, pp 595-601, 1986. 

[4] R.W. Butler, The SURE approach to reliability analy- 
sis, IEEE Thins. Reliability, vol A1 , pp 210-218, 1992. 

[5] Chow. E.Y., and Willsky, A.S., Analytical redun- 
dancy and the design of robust detection systems, IEEE 
Transaction on Automatic Control, vol. 29, pp. 603-614, 
1984. 

[6] Dugan, and Trivedi, Coverage modeling for depend- 
ability analysis of fault tolerant systems, IEEE Trans. Com- 
puters, vol .38, pp 775-787, 1989. 

[7] ELsayed, A.E., Reliability Engineering , Addison- 
Wesley, i996. 

[8] Howard, R.A., Dynamic probabilistic systems: VI, 

Markov Models, V2, Semi- Markov and Decision Processes, 
Wiley, 1971. 

[9] Isermann, R., and Balle, P., Trands in the application 
of mode -based fault detection and diagnosis of technical 
processes, Control Engineering Practices, vol. 5, pp. 709-719, 
1997. 

[10] Van Schrick, D., Muller, P., Reliability models for 
sensor fault detection with state estimator schemes, Chap- 
ter 8 in Issues of Fault Diagnosis for Dynamic Systems, 
(Patton, Frank, Clark, eds.), Springer- Verlag, 2000 

[11] Walker, B., Fault tolerant control system reliabil- 
ity and performance prediction using semi- Markov models, 
Proc. Safeprocess, 1997. 

[12] White, A.L., Reliability estimation for reconfigurable 
systems with fast recovery, Micivelectronics Reliability, 
vol .2 6, pp.1111-1 120, 1986. 

[13] Wu, N. E., and T.J. Chen, Reliability prediction for 
self- repairing flight control systems, Proc. 35 th IEEE Con- 
ference on Decision and Control, Kobe, Japan, Dec., 1996. 

[14] Wu, N.E., and Klir, G.J., Optimal redundancy man- 
agement in reconfigurable control systems based on normal- 
ized nor specificity. International Journal of Systems Sci- 
ence, vol .31, pp. 797-808, 2000. 

[15] Wu, N.E, Zhou, K., and Salomon, G., Reconfigura- 
bility in linear time-invariant systems, Automatica, vol.36, 
pp.1767-1771, 2000. 

[16] V\u, N.E., Hardware reduction through use of control 
surface redundancy, Technical report to General Electric 
Company, 1991. 



